Announcement

Dubois Challenge 2025



Data Visualization: From Exploration to Communication

Introduction to the Data Visualization Process

The journey of data visualization begins with four fundamental questions:

  1. What data do you have?
  2. What do you want to know about your data?
  3. What visualization methods should you use?
  4. What do you see and does it make sense?
Let’s explore each type of data visualization with practical examples in R.

Types of Data Visualization

1. Categorical Data

Categorical data represents discrete groups or categories. Common visualization methods include:

  • Bar graphs (distribution)
  • Stacked bar charts (parts of a whole)
  • Pie charts (parts of a whole)
  • Treemaps (subcategories)
  • Mosaic plots (multiple variables)
“With categorical data, you often look for the minimum and maximum right away. This gives you a sense of the range of the dataset, and is easily found with a quick sorting of the range of values. After that, look at the distribution of the parts. Are most values high? Low? Somewhere in between? Finally, look for structure and patterns. If a couple of categories have the same value or high differing ones, it’s worth asking why and what makes the categories similar or different, respectively.” - Yau, Data Points, 152-153

Example in R:

# Creating a simple bar plot
library(ggplot2)

# Sample data
categories <- c("A", "B", "C", "D")
values <- c(23, 45, 12, 78)
data <- data.frame(categories, values)

# Basic bar plot
ggplot(data, aes(x = categories, y = values)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_minimal() +
  labs(title = "Simple Bar Plot",
       x = "Categories",
       y = "Values")

# Stacked bar chart example
library(ggplot2)

# Create sample data for stacked bar chart
stacked_data <- data.frame(
  year = rep(2020:2023, each = 3),
  category = rep(c("Product A", "Product B", "Product C"), times = 4),
  sales = c(
    30, 20, 15,  # 2020
    35, 25, 20,  # 2021
    40, 30, 25,  # 2022
    45, 35, 30   # 2023
  )
)

# Create stacked bar chart
ggplot(stacked_data, aes(x = factor(year), y = sales, fill = category)) +
  geom_bar(stat = "identity", position = "stack") +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  labs(
    title = "Sales by Product Category Over Time",
    x = "Year",
    y = "Sales",
    fill = "Product Category"
  )

# To create a 100% stacked bar chart (proportions), just change position to "fill"
ggplot(stacked_data, aes(x = factor(year), y = sales, fill = category)) +
  geom_bar(stat = "identity", position = "fill") +
  scale_fill_brewer(palette = "Set2") +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal() +
  labs(
    title = "Product Category Distribution Over Time",
    x = "Year",
    y = "Percentage",
    fill = "Product Category"
  )

# Treemap example with hierarchical structure
library(treemapify)

# Create sample hierarchical data
treemap_data <- data.frame(
    department = rep(c("Sales", "Marketing", "R&D"), times = c(6, 4, 5)),
    team = c(
        # Sales teams
        rep("North", 2), rep("South", 2), rep("West", 2),
        # Marketing teams
        rep("Digital", 2), rep("Traditional", 2),
        # R&D teams
        rep("Product A", 3), rep("Product B", 2)
    ),
    project = c(
        # Sales projects
        "Corporate", "SMB",
        "Corporate", "SMB",
        "Corporate", "SMB",
        # Marketing projects
        "Social", "Email",
        "Print", "TV",
        # R&D projects
        "Research", "Development", "Testing",
        "Research", "Development"
    ),
    value = c(
        # Sales values
        250, 150,
        200, 180,
        220, 160,
        # Marketing values
        120, 90,
        100, 80,
        # R&D values
        150, 180, 120,
        140, 160
    )
)

# Create hierarchical treemap
ggplot(treemap_data, 
       aes(area = value, 
           fill = department,
           subgroup = team,
           label = project)) +
  geom_treemap() +
  geom_treemap_subgroup_border(colour = "white", size = 2) +
  geom_treemap_subgroup_text(place = "centre", 
                            grow = TRUE, 
                            alpha = 0.5, 
                            colour = "black",
                            fontface = "bold") +
  geom_treemap_text(colour = "white", 
                    place = "centre", 
                    size = 10) +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  labs(title = "Company Structure and Project Distribution",
       subtitle = "By Department, Team, and Project",
       fill = "Department")

# Mosaic plot example
library(vcd)

# Sample data for mosaic plot
data <- data.frame(
  gender = rep(c("M", "F"), each = 100),
  age_group = rep(c("Young", "Middle", "Old"), times = c(70, 80, 50)),
  education = rep(c("High", "Medium", "Low"), times = c(60, 90, 50))
)

mosaic(~ gender + age_group + education, 
       data = data,
       shade = TRUE,
       legend = TRUE)

2. Time Series Data

Time series data shows how variables change over time. Visualization options include:

  • Line charts
  • Bar graphs
  • Dot plots
  • Dot-bar graphs
  • Cycle plots
  • Calendar heatmaps
“Generally speaking, look for changes over time. More specifically, note the nature of the changes. Are the changes relatively a lot or are they small? If they’re small, is the change still significant? Think of possible reasons for what you see over time or sudden blips and if they make sense. The change itself is interesting, but more importantly, you want to know the significance of a change.” - Yau, Data Points, 165

Example in R:

# Creating a line chart with time series data
library(ggplot2)

# Generate sample time series data
dates <- seq(as.Date("2024-01-01"), as.Date("2024-12-31"), by = "month")
values <- rnorm(12, mean = 100, sd = 10)
ts_data <- data.frame(date = dates, value = values)

# Time series plot
ggplot(ts_data, aes(x = date, y = value)) +
  geom_line(color = "darkblue") +
  geom_point() +
  theme_minimal() +
  labs(title = "Monthly Values Over Time",
       x = "Date",
       y = "Value")

3. Spatial Data

Spatial data visualization helps understand geographical patterns through:

  • Maps
  • Cartograms
  • Location maps
  • Connection maps
  • Choropleth maps

“Spatial data is a lot like categorical data, but with a geographic component. You should know the range of the data to start with, and then look for regional patterns. Are there higher or lower values clustered in a certain area of a country or continent? Because a single value only tells you about a small part about a region filled with people, think about what a pattern implies and look to other datasets to verify hunches.” - Yau, Data Points, 176

Example in R:

# Creating a simple choropleth map
library(maps)
library(ggplot2)

# Get US states map data
states_map <- map_data("state")

# Create sample data
state_data <- data.frame(
  state = unique(states_map$region),
  value = runif(length(unique(states_map$region)), 0, 100)
)

# Plot choropleth map
ggplot() +
  geom_map(data = states_map, map = states_map,
           aes(x = long, y = lat, map_id = region),
           fill = "white", color = "grey50") +
  geom_map(data = state_data, map = states_map,
           aes(fill = value, map_id = state),
           color = "grey50") +
  scale_fill_viridis_c() +
  theme_void() +
  labs(title = "US States Choropleth Map",
       fill = "Value")

4. Multiple Variables

When dealing with multiple variables, consider:

  • Scatter plots
  • Heat maps
  • Bubble charts
  • Parallel coordinates

“There are a lot of visualization methods that help you explore various aspects of your data, whether it is categories, time, space, or a combination of these. You can visualize the data all at once, but you can also make use of simpler, more straightforward views, which can help extract relationships. Sometimes the relationships are straightforward between two variables, but usually the relationship is complex, especially when you introduce more than two variables. Don’t make assumptions as you explore relationships, and keep in mind there are variables not captured in the data that might contribute to changes. Finally, when it comes to correlation and causation, you need to take in all the context you can before you assign the latter.” Yau, Data Points, 189

Example in R:

# Creating a scatter plot with multiple variables
library(ggplot2)

# Generate sample data
set.seed(123)
n <- 100
data <- data.frame(
  x = rnorm(n),
  y = rnorm(n),
  group = factor(sample(1:3, n, replace = TRUE)),
  size = runif(n, 1, 10)
)

# Multi-variable scatter plot
ggplot(data, aes(x = x, y = y, color = group, size = size)) +
  geom_point(alpha = 0.6) +
  theme_minimal() +
  labs(title = "Multi-variable Scatter Plot",
       x = "X Variable",
       y = "Y Variable")

5. Distributions

To visualize data distributions, use:

  • Box plots
  • Violin plots
  • Histograms
  • Density plots
  • Heat maps
  • Surface plots

“Regardless of the type of visualization you use to explore distributions, look for peaks and valleys, range,and the spread of your data, which tell you a lot more than just the mean and median would. The visual analysis of raw data and the variation in between the summary statistics are almost always more interesting, so make use of the opportunity when you get it.” Yau, Data Points, 199

Example in R:

# Creating multiple distribution plots
library(ggplot2)

# Generate sample data
data <- data.frame(
  group = rep(c("A", "B", "C"), each = 100),
  value = c(rnorm(100), rnorm(100, 1), rnorm(100, 2))
)

# Create violin plot with boxplot inside
ggplot(data, aes(x = group, y = value, fill = group)) +
  geom_violin(alpha = 0.5) +
  geom_boxplot(width = 0.2, alpha = 0.8) +
  theme_minimal() +
  labs(title = "Distribution Comparison",
       x = "Group",
       y = "Value")

\(~~~~~\)

\(~~~~~\)

Principles for Clear Visualization

Visual Hierarchy

  • Use size, color, and position to guide attention
  • Emphasize important elements
  • Create clear relationships between elements
  • Negative space

Readability

  • Use appropriate font sizes and types
  • Maintain consistent spacing
  • Include clear labels and legends

Comparisons

  • Align elements for easy comparison
  • Use consistent scales
  • Group related information

Context

  • Provide necessary background information
  • Include reference points
  • Show relevant time periods or categories

Highlighting

  • Use color strategically
  • Employ contrast effectively
  • Draw attention to key findings

Annotation

  • Add explanatory text
  • Include statistical concepts where relevant
  • Use clear, concise typography

Best Practices

  1. Start with a clear purpose
  2. Choose appropriate visualization types
  3. Maintain simplicity
  4. Use color purposefully
  5. Include necessary context
  6. Test for clarity
  7. Iterate based on feedback

Remember to always “do the math” - verify that your visualizations accurately represent the underlying data and statistical concepts.

Conclusion

Effective data visualization is a balance between technical accuracy and clear communication. By following these principles and choosing appropriate visualization methods, you can create compelling and informative visual representations of your data.

\(~~~~~\)

Exercise:

\(~~~~~\)

1

\(~~~~~\)

2

\(~~~~~\)

3

\(~~~~~\)